retrieval augmented generation
- North America > United States (0.30)
- Asia > China > Shanghai > Shanghai (0.06)
- Asia > Singapore (0.05)
When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare
Javadi, Saeedeh, Mirabi, Sara, Gangar, Manan, Ofoghi, Bahadorreza
In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.
- Health & Medicine (1.00)
- Media > News (0.48)
Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation
Cook, Thomas, Osuagwu, Richard, Tsatiashvili, Liman, Vrynsia, Vrynsia, Ghosal, Koustav, Masoud, Maraim, Mattivi, Riccardo
Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question--answer--reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.
- Asia > Middle East > UAE (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Steering Over-refusals Towards Safety in Retrieval Augmented Generation
Maskey, Utsav, Dras, Mark, Naseem, Usman
Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Ukraine (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Security & Privacy (1.00)
- Law (0.70)
- Health & Medicine > Therapeutic Area (0.68)
- Banking & Finance (0.68)
- North America > United States (0.30)
- Asia > China > Shanghai > Shanghai (0.06)
- Asia > Singapore (0.05)
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Indonesia > Bali (0.04)
- (5 more...)
- Media (0.46)
- Banking & Finance (0.46)
- Leisure & Entertainment (0.46)
ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation
Maragheh, Reza Yousefi, Vadla, Pratheek, Gupta, Priyank, Zhao, Kai, Inan, Aysenur, Yao, Kehui, Xu, Jianpeng, Kanumala, Praveen, Cho, Jason, Kumar, Sushant
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
- North America > United States > California > Santa Clara County > Sunnyvale (0.06)
- North America > United States > Washington > King County > Bellevue (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Lessons from A Large Language Model-based Outdoor Trail Recommendation Chatbot with Retrieval Augmented Generation
Mathew, Julia Ann, He, Suining
The increasing popularity of outdoor recreational activities (such as hiking and biking) has boosted the demand for a conversational AI system to provide informative and personalized suggestion on outdoor trails. Challenges arise in response to (1) how to provide accurate outdoor trail information via conversational AI; and (2) how to enable usable and efficient recommendation services. To address above, this paper discusses the preliminary and practical lessons learned from developing Judy, an outdoor trail recommendation chatbot based on the large language model (LLM) with retrieval augmented generation (RAG). To gain concrete system insights, we have performed case studies with the outdoor trails in Connecticut (CT), US. We have conducted web-based data collection, outdoor trail data management, and LLM model performance studies on the RAG-based recommendation. Our experimental results have demonstrated the accuracy, effectiveness, and usability of Judy in recommending outdoor trails based on the LLM with RAG.
- North America > United States > Connecticut (0.26)
- Europe > Germany > Hamburg (0.04)
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets
Brehme, Lorenz, Ströhle, Thomas, Breu, Ruth
Can LLMs Be Trusted for Evaluating RAG Systems? Abstract--Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. The complexity of RAG systems, which involve multiple components--such as indexi ng, retrieval, and generation--along with numerous other param e-ters, poses substantial challenges for systematic evaluat ion and quality enhancement. Previous research highlights that ev aluating RAG systems is essential for documenting advancements, com - paring configurations, and identifying effective approach es for domain-specific applications. This study systematically r eviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies, focusing on four key areas: datasets, retrievers, indexing and databases, a nd the generator component. We observe the feasibility of an automated evaluation approach for each component of a RAG system, leveraging an LLM capable of both generating evalua tion datasets and conducting evaluations. In addition, we found that further practical research is essential to provide compani es with clear guidance on the do's and don'ts of implementing and evaluating RAG systems. By synthesizing evaluation approa ches for key RAG components and emphasizing the creation and adaptation of domain-specific datasets for benchmarking, w e contribute to the advancement of systematic evaluation met hods and the improvement of evaluation rigor for RAG systems. Furthermore, by examining the interplay between automated approaches leveraging LLMs and human judgment, we contribute to the ongoing discourse on balancing automation and human input, clarifying their respective contributions, limita tions, and challenges in achieving robust and reliable evaluations. In recent years, Large Language Models (LLMs) have made significant progress in research and have grown increasingl y popular [1]. However, LLMs face several challenges, includ - ing issues with hallucinations caused by insufficient conte xt [2], as well as limitations in their learned content, which prevent them from addressing questions requiring specific or proprietary information [1].
- Europe > Austria > Tyrol > Innsbruck (0.05)
- Europe > Switzerland (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Bavaria > Regensburg (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)